摘要 :
Increasing communication overheads are already threatening computer system scaling. One approach to dramatically reduce communication overheads is waferscale processing.However, waferscale processors have been historically deemed ...
展开
Increasing communication overheads are already threatening computer system scaling. One approach to dramatically reduce communication overheads is waferscale processing.However, waferscale processors have been historically deemed impractical due to yield issues inherent to conventional integration technology. Emerging integration technologies such as Silicon-Interconnection Fabric (Si-IF), where pre-manufactured dies are directly bonded on to a silicon wafer, may enable one to build a waferscale system without the corresponding yield issues. As such, waferscalar architectures need to be revisited. In this paper, we study if it is feasible and useful to build today's architectures at waferscale. Using a waferscale GPU as a case study, we show that while a 300 mm wafer can house about 100 GPU modules (GPM), only a much scaled down GPU architecture with about 40 GPMs can be built when physical concerns are considered. We also study the performance and energy implications of waferscale architectures. We show that waferscale GPUs can provide significant performance and energy efficiency advantages,(up to 18.9x speedup and 143x EDP benefit compared against equivalent MCM-GPU based implementation on PCB) without any change in the programming model. We also develop thread scheduling and data placement policies for waferscale GPU architectures. Our policies outperform state-of-art scheduling and data placement policies by up to 2.88x (average 1.4x) and 1.62x (average 1.11x) for 24 GPM and 40 GPM cases respectively. Finally, we build the first Si-IF prototype with interconnected dies-we observe 100\% of the inter-die interconnects to be successfully connected in our prototype, coupled with high yield reported previously for bonding of dies on Si-IF, demonstrates the technological readiness for building waferscale GPU architecture.
收起
摘要 :
Increasing communication overheads are already threatening computer system scaling. One approach to dramatically reduce communication overheads is waferscale processing.However, waferscale processors have been historically deemed ...
展开
Increasing communication overheads are already threatening computer system scaling. One approach to dramatically reduce communication overheads is waferscale processing.However, waferscale processors have been historically deemed impractical due to yield issues inherent to conventional integration technology. Emerging integration technologies such as Silicon-Interconnection Fabric (Si-IF), where pre-manufactured dies are directly bonded on to a silicon wafer, may enable one to build a waferscale system without the corresponding yield issues. As such, waferscalar architectures need to be revisited. In this paper, we study if it is feasible and useful to build today's architectures at waferscale. Using a waferscale GPU as a case study, we show that while a 300 mm wafer can house about 100 GPU modules (GPM), only a much scaled down GPU architecture with about 40 GPMs can be built when physical concerns are considered. We also study the performance and energy implications of waferscale architectures. We show that waferscale GPUs can provide significant performance and energy efficiency advantages,(up to 18.9x speedup and 143x EDP benefit compared against equivalent MCM-GPU based implementation on PCB) without any change in the programming model. We also develop thread scheduling and data placement policies for waferscale GPU architectures. Our policies outperform state-of-art scheduling and data placement policies by up to 2.88x (average 1.4x) and 1.62x (average 1.11x) for 24 GPM and 40 GPM cases respectively. Finally, we build the first Si-IF prototype with interconnected dies-we observe 100\% of the inter-die interconnects to be successfully connected in our prototype, coupled with high yield reported previously for bonding of dies on Si-IF, demonstrates the technological readiness for building waferscale GPU architecture.
收起
摘要 :
Demand for increasing performance is far outpacing the capability of traditional methods for performance scaling. Disruptive solutions are needed to advance beyond incremental improvements. Traditionally, processors reside inside ...
展开
Demand for increasing performance is far outpacing the capability of traditional methods for performance scaling. Disruptive solutions are needed to advance beyond incremental improvements. Traditionally, processors reside inside packages to enable PCB-based integration. We argue that packages reduce the potential memory bandwidth of a processor by at least one order of magnitude, allowable thermal design power (TDP) by up to 70\%, and area efficiency by a factor of 5 to 18. Further, silicon chips have scaled well while packages have not. We propose packageless processors - processors where packages have been removed and dies directly mounted on a silicon board using a novel integration technology, Silicon Interconnection Fabric (Si-IF). We show that Si-IF-based packageless processors outperform their packaged counterparts by up to 58\% (16\% average), 136\%(103\% average), and 295\% (80\% average) due to increased memory bandwidth, increased allowable TDP, and reduced area respectively. We also extend the concept of packageless processing to the entire processor and memory system, where the area footprint reduction was up to 76\%.
收起
摘要 :
Demand for increasing performance is far outpacing the capability of traditional methods for performance scaling. Disruptive solutions are needed to advance beyond incremental improvements. Traditionally, processors reside inside ...
展开
Demand for increasing performance is far outpacing the capability of traditional methods for performance scaling. Disruptive solutions are needed to advance beyond incremental improvements. Traditionally, processors reside inside packages to enable PCB-based integration. We argue that packages reduce the potential memory bandwidth of a processor by at least one order of magnitude, allowable thermal design power (TDP) by up to 70\%, and area efficiency by a factor of 5 to 18. Further, silicon chips have scaled well while packages have not. We propose packageless processors - processors where packages have been removed and dies directly mounted on a silicon board using a novel integration technology, Silicon Interconnection Fabric (Si-IF). We show that Si-IF-based packageless processors outperform their packaged counterparts by up to 58\% (16\% average), 136\%(103\% average), and 295\% (80\% average) due to increased memory bandwidth, increased allowable TDP, and reduced area respectively. We also extend the concept of packageless processing to the entire processor and memory system, where the area footprint reduction was up to 76\%.
收起
摘要 :
On-chip memory dominates die area for most processor and logic designs. For the last several years embedded DRAM has become successful in replacing conventional SRAM for many of these applications. We will outline the key requirem...
展开
On-chip memory dominates die area for most processor and logic designs. For the last several years embedded DRAM has become successful in replacing conventional SRAM for many of these applications. We will outline the key requirements of embedded memory and the tradeoffs between SRAM and DRAM for these applications. As scaling has progressed into the 45 nm regime and beyond, SRAM scaling has become more difficult and it has become difficult to reduce the SRAM voltages due to instabilities that develop as voltage is lowered. Embedded DRAMs offer a 3-4X density advantage, a better than 5X standby power advantage, and a significantly higher resistance to soft error upsets. In the future, three dimensional integration (3Di) of circuits is an attractive approach to stay on the semiconductor productivity roadmap. Briefly, the principal value of 3-D integration lies in increasing the volumetric transistor density with the potential benefit of shorter electrical path lengths through use of the shorter third dimension. However, as we study this problem in some detail we uncover a serious set of constraints especially for high performance applications. These constraints can be classified in four broad categories: the overhead and design constraints of through silicon vias (TSVs); Power delivery and distribution in multiple strata; heat dissipation across the 3 D stack; and finally reparability of the 3D stack. In this talk we will examine these constraints in detail based on our experience with high end processors.
收起
摘要 :
On-chip memory dominates die area for most processor and logic designs. For the last several years embedded DRAM has become successful in replacing conventional SRAM for many of these applications. We will outline the key requirem...
展开
On-chip memory dominates die area for most processor and logic designs. For the last several years embedded DRAM has become successful in replacing conventional SRAM for many of these applications. We will outline the key requirements of embedded memory and the tradeoffs between SRAM and DRAM for these applications. As scaling has progressed into the 45 nm regime and beyond, SRAM scaling has become more difficult and it has become difficult to reduce the SRAM voltages due to instabilities that develop as voltage is lowered. Embedded DRAMs offer a 3-4X density advantage, a better than 5X standby power advantage, and a significantly higher resistance to soft error upsets. In the future, three dimensional integration (3Di) of circuits is an attractive approach to stay on the semiconductor productivity roadmap. Briefly, the principal value of 3-D integration lies in increasing the volumetric transistor density with the potential benefit of shorter electrical path lengths through use of the shorter third dimension. However, as we study this problem in some detail we uncover a serious set of constraints especially for high performance applications. These constraints can be classified in four broad categories: the overhead and design constraints of through silicon vias (TSVs); Power delivery and distribution in multiple strata; heat dissipation across the 3 D stack; and finally reparability of the 3D stack. In this talk we will examine these constraints in detail based on our experience with high end processors.
收起
摘要 :
Electrostatic discharge (ESD) failure results in about 35% of IC field returns, and is the cause of several billiondollar loss to the semiconductor industry. An on-chip ESD detector can help track the electrostatic history of ICs ...
展开
Electrostatic discharge (ESD) failure results in about 35% of IC field returns, and is the cause of several billiondollar loss to the semiconductor industry. An on-chip ESD detector can help track the electrostatic history of ICs from manufacturing to end-of-life. Two approaches for on-chip ESD detection are presented: variable dielectric width capacitor, and vertical MOSCAP array. The variable dielectric width capacitor approach employs metal plates terminated with sharp corners to enhance local electric field and facilitate easy breakdown of the thin dielectric between the metal plates. The vertical MOSCAP array consists of a capacitor array connected in series. Both approaches were simulated, fabricated, and experimentally characterized in GlobalFoundries 22 nm fully depleted silicon-oninsulator. Vertical MOSCAP arrays detect ESD events starting from ~6 V with 6V granularity, while the variable dielectric width capacitor is suitable for detection of high ESD voltage from 40 V and above.
收起
摘要 :
Electrostatic discharge (ESD) failure results in about 35% of IC field returns, and is the cause of several billion-dollar loss to the semiconductor industry. An on-chip ESD detector can help track the electrostatic history of ICs...
展开
Electrostatic discharge (ESD) failure results in about 35% of IC field returns, and is the cause of several billion-dollar loss to the semiconductor industry. An on-chip ESD detector can help track the electrostatic history of ICs from manufacturing to end-of-life. Two approaches for on-chip ESD detection are presented: variable dielectric width capacitor, and vertical MOSCAP array. The variable dielectric width capacitor approach employs metal plates terminated with sharp corners to enhance local electric field and facilitate easy breakdown of the thin dielectric between the metal plates. The vertical MOSCAP array consists of a capacitor array connected in series. Both approaches were simulated, fabricated, and experimentally characterized in GlobalFoundries 22 nm fully depleted silicon-on-insulator. Vertical MOSCAP arrays detect ESD events starting from ~6 V with 6V granularity, while the variable dielectric width capacitor is suitable for detection of high ESD voltage from 40 V and above.
收起
摘要 :
Electrostatic discharge (ESD) failure results in about 35% of IC field returns, and is the cause of several billiondollar loss to the semiconductor industry. An on-chip ESD detector can help track the electrostatic history of ICs ...
展开
Electrostatic discharge (ESD) failure results in about 35% of IC field returns, and is the cause of several billiondollar loss to the semiconductor industry. An on-chip ESD detector can help track the electrostatic history of ICs from manufacturing to end-of-life. Two approaches for on-chip ESD detection are presented: variable dielectric width capacitor, and vertical MOSCAP array. The variable dielectric width capacitor approach employs metal plates terminated with sharp corners to enhance local electric field and facilitate easy breakdown of the thin dielectric between the metal plates. The vertical MOSCAP array consists of a capacitor array connected in series. Both approaches were simulated, fabricated, and experimentally characterized in GlobalFoundries 22 nm fully depleted silicon-oninsulator. Vertical MOSCAP arrays detect ESD events starting from ~6 V with 6V granularity, while the variable dielectric width capacitor is suitable for detection of high ESD voltage from 40 V and above.
收起
摘要 :
In this paper, we describe the performance and power benefits of our Fine Pitch integration scheme on a Silicon Interconnect Fabric (Si IF). Here we propose a Simple Universal Parallel intERface (SuperCHIPS) protocol enabled by fi...
展开
In this paper, we describe the performance and power benefits of our Fine Pitch integration scheme on a Silicon Interconnect Fabric (Si IF). Here we propose a Simple Universal Parallel intERface (SuperCHIPS) protocol enabled by fine pitch dielet to interconnect fabric assembly. We show the dramatic improvements in bandwidth, latency, and power are achievable through our integration scheme where small dielets (1-25 mm2) are attached to a rigid Silicon Interconnect Fabric (Si-IF) at fine interconnect pitch (2-10 μm) and short inter-die distance (50-500 μm) using solderless metal-to-metal thermal compression bonding (TCB). Our simulations show that links in the Si-IF with short wire-lengths (5-25x improvement in data bandwidth. This can improve system performance (>20x) when compared to PCB-style integration and may even approach single die SoC metrics in some cases. Furthermore our protocol is simple and non-proprietary. We show that this scheme enables heterogeneous system integration using a dielet based assembly method and provides significant reduction in design and validation cost.System-level analysis of heterogeneous integration scheme promises power benefits of more than 15\% even for very small systems.
收起